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ABSTRACT 

We report experimental results associated with speech-driven text 
retrieval, which facilitates retrieving information in multiple do- 
mains with spoken queries. Since users speak contents related to 
a target collection, we produce language models used for speech 
recognition based on the target collection, so as to improve both 
the recognition and retrieval accuracy. Experiments using existing 
test collections combined with dictated queries showed the effec- 
tiveness of our method. 



1. INTRODUCTION 

Automatic speech recognition, which decodes human voice to gen- 
erate transcriptions, has of late become a practical technology. It 
is feasible that speech recognition is used in real world computer- 
based applications, specifically, those associated with human lan- 
guage. In fact, a number of speech-based methods have been ex- 
plored in the information retrieval (IR) community, which can be 
classified into the following two fundamental categories: 

• spoken document retrieval, in which written queries are used 
to search speech (e.g., broadcast news audio) archives for 
relevant speech information [|IJ] . 

• speech-driven retrieval, in which spoken queries are used to 
retrieve relevant textual information [^|, ^]. 

Initiated partially by the TREC-6 spoken document retrieval 
(SDR) track 111], various methods have been proposed for spoken 
document retrieval. However, a relatively small number of meth- 
ods have been explored for speech-driven text retrieval, although 
they are associated with numerous keyboard-less retrieval appli- 
cations, such as telephone-based retrieval, car navigation systems, 
and user-friendly interfaces. 

Barnett et al. [^] performed comparative experiments related 
to speech-driven retrieval, where the DRAGON speech recogni- 
tion system was used as an input interface for the INQUERY text 
retrieval system. They used as test inputs 35 queries collected from 
the TREC topics and dictated by a single male speaker. Crestani [^] 
also used the above 35 queries and showed that conventional rel- 
evance feedback techniques marginally improved the accuracy for 
speech-driven text retrieval. 

These above cases focused solely on improving text retrieval 
methods and did not address problems of improving speech recog- 
nition accuracy. In fact, an existing speech recognition system was 



used with no enhancement. In other words, speech recognition and 
text retrieval modules were fundamentally independent and were 
simply connected by way of an input/output protocol. 

However, since most speech recognition systems are trained 
based on specific domains, the accuracy of speech recognition 
across domains is not satisfactory. Thus, as can easily be predicted, 
in cases of Barnett et al. [^J] and Crestani a speech recogni- 
tion error rate was relatively high and considerably decreased the 
retrieval accuracy. Additionally, speech recognition with a high 
accuracy is crucial for interactive retrieval, such as dialog-based 
retrieval. 

Motivated by these problems, in this paper we integrate (not 
simply connect) speech recognition and text retrieval to improve 
both recognition and retrieval accuracy in the context of speech- 
driven text retrieval. 

Unlike general-purpose speech recognition aimed to decode 
any spontaneous speech, in the case of speech-driven text retrieval, 
users usually speak contents associated with a target collection, 
from which documents relevant to their information need are re- 
trieved. In a stochastic speech recognition framework, the ac- 
curacy depends primarily on acoustic and language models [Q], 
While acoustic models are related to phonetic properties, language 
models, which represent linguistic contents to be spoken, are re- 
lated to target collections. Thus, it is intuitively feasible that lan- 
guage models have to be produced based on target collections. 

To sum up, our belief is that by adapting a language model 
based on a target IR collection, we can improve the speech recog- 
nition and text retrieval accuracy, simultaneously. 

Section |^ describes our speech-driven text retrieval system, 
which is currently implemented for Japanese. Section j| elaborates 
on comparative experiments, in which IR test collections in differ- 
ent domains are used to evaluate the effectiveness of our system. 



2. SYSTEM DESCRIPTION 

2.1. Overview 

Figure [j] depicts the overall design of our speech-driven text re- 
trieval system, which consists of speech recognition and text re- 
trieval modules. In the following sections, we explain two modules 
in Figure |l|, respectively. 



The first and second authors are also members of CREST, Japan Sci- 
ence and Technology Corporation. 
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Fig. 1. The design of our speech-driven text retrieval system. 



2.2. Speech Recognition 

For the speech recognition module, we use the Japanese dictation 
toolkit which includes the "Julius" recognition engine and 
acoustic/language models. Julius performs a two-pass (forward- 
backward) search using word-based forward bigrams and back- 
ward trigrams on the respective passes. 

The acoustic model was produced by way of the ASJ speech 
databases of phonetically balanced sentences (ASJ-PB) and news- 
paper articles texts (ASJ-JNAS) [[|, which contain approximately 
20,000 sentences uttered by 132 speakers including the both gen- 
der groups. We used a 16-mixture Gaussian distribution triphone 
Hidden Markov Model, where states were clustered into 2,000 
groups by a state-tying method. 

This toolkit also includes development softwares, so that acous- 
tic and language models can be produced and replaced depending 
on the application. While we use the acoustic model provided 
in the toolkit, we use new language models produced by way of 
source documents (i.e., target IR collections). 



2.3. Text Retrieval 

The text retrieval module is based on the "Okapi" method 
which computes the relevance score between the transcribed query 
and each document in the collection, based on the distribution of 
index terms, and sorts retrieved documents according to the score 
in descending order. 

We use content words extracted from documents as index terms, 
and perform a word-based indexing. For this purpose, we use the 
ChaSen morphological analyzer 18J to extract content words. We 
extract terms from transcribed queries using the same method. 

3. EXPERIMENTATION 
3.1. Test Collections 

To investigate the performance of our multi-domain speech-driven 
retrieval system, we used two different types of Japanese IR test 
(benchmark) collections: the NTCIR and IREX collections. Both 
collections, which resemble one used in the TREC ad hoc retrieval 
track, include topics (information need) and relevance assessment 



(correct judgement) for each topic, along with target documents. 
However, these collections are associated with different domain, 
respectively. 

The NTCIR collection [^J] includes 736,166 abstracts col- 
lected from technical papers published by 65 Japanese associations 
for various fields. On the other hand, the IREX collection |ic[f| in- 
cludes 21 1,853 articles collected from two years worth of "Mainichi 
Shimbun" newspaper articles]]. 

The NTCIR and IREX collections include 132 and 30 Japanese 
topics, respectively, for a sample of which English translations are 
also provided. Figures ^| and^| show example topics in each col- 
lection, which consist of different fields (for example, descriptions 
and narratives) tagged in an SGML form. 

Since both collections do not contain spoken queries, we asked 
four speakers (two males/females) to dictate topics. For this pur- 
pose, we selectively used a specific field, so as to simulate a real- 
istic speech-driven retrieval. 

In the case of the NTCIR topics, titles are not informative for 
the retrieval. On the other hand, narratives, which usually consist 
of several sentences, are too long to speak. Thus, only descrip- 
tions, which consist of a single phrase and sentence, were dictated 
by each speaker, so as to produce four different sets of 132 spoken 
queries. However, in the case of the IREX topics, since descrip- 
tions are not informative for the retrieval, only narratives were dic- 
tated by each speaker, to produce four different sets of 30 spoken 
queries. 

3.2. Comparative Evaluation 

We compared the performance of the following retrieval methods: 

• text-to-text retrieval, which used written queries, and can 
be seen as the perfect speech-driven text retrieval, 

• speech-driven text retrieval, in which a language model pro- 
duced based on the NTCIR collection was used, 

• speech-driven text retrieval, in which a language model pro- 
duced based on the IREX collection was used. 

In cases of speech-driven text retrieval methods, queries dictated 
by four speakers were used independently, and the final result was 
obtained by averaging results for different speakers. 

Although the Julius decoder outputs more than one transcrip- 
tion candidates for a single speech, we used only the one with the 
greatest probability score. The results did not significantly change 
depending on whether or not we used lower-ranked transcriptions 
as queries. 

The only difference in producing two different language mod- 
els (i.e., those based on the NTCIR and IREX collections) is the 
source documents. In other words, both language models were of 
the same vocabulary size (20,000), and were produced by way of 
the same softwares. 

Table [l] shows statistics related to word tokens/types in two 
different collections for language modeling, where the line "Cov- 
erage" denotes the ratio of word tokens contained in the resultant 
language model. Most of word tokens were covered irrespective 
of the collection. 



http://winnie.kuis.kyoto-u.ac.jp/dictation/ 



2 http://research. nii.ac.jp/~ntcadm/index-en. html 
3 http://cs.nyu.edu/cs/projects/proteus/irex/index-e.html 
4 In practice, the IREX collection provides only article IDs, which cor- 
responds to articles in Mainichi Shimbun newspaper CD-ROM'94-'95. 
Participants must get a copy of the CD-ROMs themselves. 
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<TOPIC q=0123> 
<TITLE>Biof ilms</TITLE> 

<DESCRIPTION>Are there any documents about the biofilms produced by some microorganisms in 
which chronic diseases are mentioned?</DESCRIPTION> 

<NARRATIVE>Biof ilms are thought to occur when microorganisms grow in microcolonies embedded 
in the adherent gel surface on tunica mucosa, and teeth, or on catheters, prosthetic valves, 
and other artifacts. A relevant document will report on any studies into the relationship 
between biofilms produced by some microorganisms and chronic diseases. Documents that 
include reports on biofilms produced by non-medical microorganisms that do not cause 
infectious diseases are not relevant . </NARRATIVE> 
</TOPIC> 



Fig. 2. An English translation for an example topic in the NTCIR collection. 



<TOPIC> 

<TOPIC-ID>100K/TOPIC-ID> 

<DESCRIPTION>Corporate merging< /DESCRIPTION 

<NARRATIVE>The article describes a corporate merging and in the article, the name of 
companies have to be identifiable. Information including the field and the purpose of the 
merging have to be identifiable. Corporate merging includes corporate acquisition, corporate 
unifications and corporate buying . </NARRATIVE> 
</TOPIC> 



Fig. 3. An English translation for an example topic in the IREX collection. 



Table 1. Statistics related to source words for language modeling. 
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Each method retrieved 1,000 top documents, and the TREC 
evaluation software was used to calculate non-interpolated average 
precision values and plot recall-precision curves. 

Table |^ shows the non-interpolated average precision values 
(AP) and word error rate in speech recognition, for different re- 
trieval methods. As with existing experiments for speech recog- 
nition, word error rate (WER) is the ratio between the number of 
word errors (i.e., deletion, insertion, and substitution) and the to- 
tal number of words. In addition, we investigated error rate with 
respect to query terms (i.e., keywords used for retrieval), which 
we shall call "term error rate (TER)". Table [2] also shows trigram 
test-set perplexity (PP) and test-set out-of-vocabulary rate (OOV). 

It should noted that for all the evaluation measures in Table ^ 
excepting average precision, smaller values are generally obtained 
with better methods. Suggestions which can be derived from these 
results are as follows. 

First, by comparing results of different language models, one 
can see that the performance was significantly improved with a 
language model produced from the target collection, which was 
observable irrespective of the domain. Thus, producing language 
models based on target collections was quite effective for speech- 
driven text retrieval. 

Second, while in the case of the NTCIR collection, the average 
precision for speech-driven retrieval was approximately 77% of 



that obtained with text-to-text retrieval, in the case of the IREX 
collection, the average precision for speech-driven retrieval was 
quite comparable that obtained with text-to-text retrieval. 

Third, TER was generally higher than WER irrespective of the 
speaker. In other words, speech recognition for content words was 
more difficult than functional words, which were not contained in 
query terms. 

Finally, we investigated the trade-off between recall and pre- 
cision. Figures 5] and g show recall-precision curves of different 
retrieval methods, for the NTCIR and IREX collections, respec- 
tively. In these figures, the relative superiority for precision values 
due to different language models in Table ^ was also observable, 
regardless of the recall. 

4. CONCLUSION 

Aiming at speech-driven text retrieval with a high accuracy, we 
proposed a method to integrate speech recognition and text re- 
trieval methods, in which target text collections are used to pro- 
duce statistical language models for speech recognition. We also 
showed the effectiveness of our method by way of experiments, 
where dictated information needs in the NTCIR/IREX collections 
were used as queries to retrieve documents in different domains. 
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Table 2. Results for different retrieval methods targeting the NTCIR/IREX collections (AP: average precision, WER: word error rate, TER: 
term error rate, PP: trigram test-set perplexity, OOV: test-set Out-of- Vocabulary rate). 
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Fig. 4. Recall-precision curves for different methods targeting the 
NTCIR collection. 
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